58
Algorithms for Binary Neural Networks
Algorithm 3 Progressive Optimization with Center Loss
Input: The training dataset; the full-precision kernels C; the pre-trained kernels tC from ternary
PCNNs; the projection matrix W; the learning rates η1 and η2.
Output: The binary PCNNs are based on the updated C and W.
1: Initialize W randomly but C from tC;
2: repeat
3:
// Forward propagation
4:
for l = 1 to L do
5:
ˆCl
i,j ←P(W, Cl
i); // using Eq. 3.43
6:
Dl
i ←Concatenate( ˆCi,j); // using Eq. 3.45
7:
Perform activation binarization; //using the sign function
8:
Traditional 2D convolution; // using Eq. 3.46, 3.47 and 3.48
9:
end for
10:
Calculate cross-entropy loss LS;
11:
if using center loss then
12:
L′ = LS + LC;
13:
else
14:
L′ = LS;
15:
end if
16:
// Backward propagation
17:
Compute δ ˆ
Cl
i,j =
∂L′
∂ˆ
Cl
i,j ;
18:
for l = L to 1 do
19:
// Calculate the gradients
20:
calculate δCl
i; // using Eq. 3.49, 3.51 and 3.52
21:
calculate δW l
j ; // using Eq. 3.115, 3.116 and 3.56
22:
// Update the parameters
23:
Cl
i ←Cl
i −η1δCl
i; // Eq. 3.50
24:
W l
j ←W l
j −η2δW l
j ; //Eq. 3.54
25:
end for
26:
Adjust the learning rates η1 and η2.
27: until the network converges
3.5.8
Ablation Study
Parameter As mentioned above, the proposed projection loss, similar to clustering, can
control quantization. We computed the distributions of the full-precision kernels and vi-
sualized the results in Figs. 3.14 and 3.15. The hyperparameter λ is designed to balance
projection loss and cross-entropy loss. We vary it from 1e −3 to 1e −5 and finally set it
to 0 in Fig. 3.14, where the variance increases as the number of λ. When λ=0, only one
cluster is obtained, where the kernel weights are tightly distributed around the threshold
= 0. This could result in instability during binarization because little noise may cause a
positive weight to be negative and vice versa.
We also show the evolution of the distribution of how projection loss works in the training
process in Fig. 3.15. A natural question is: do we always need a large λ? As a discrete
optimization problem, the answer is no, and the experiment in Table 3.4 can verify it, i.e.,
both the projection loss and the cross-entropy loss should be considered simultaneously
with good balance. For example, when λ is set to 1e −4, the accuracy is higher than those
with other values. Thus, we fix λ to 1e −4 in the following experiments.
Learning convergence For PCNN-22 in Table 3.2, the PCNN model is trained for 200
epochs and then used to perform inference. In Fig. 3.16, we plot training and test loss with
λ = 0 and λ = 1e −4, respectively. It clearly shows that PCNNs with λ = 1e −4 (blue